1. Introduction

Our project explores the relationships between hotels and restaurants located at Times Square, based on analyzing data from “NYC Open Data Source” and Yelp. Current applications in the market might focus solely on hotels or restaurants, but we would like to see if there exists interactive correlations between hotels and restaurants surrounded in terms of price and rating. The results might have an influence on travelers’ hotel choice when they travel, especially for the top NYC attraction and landmark-Times Square.

Group responsibilities are listed as follow: Pin-Hao Chen—data preprocessing and interactive component designing; Youyang Liu—data exploration and data visualization; Jolie Tan—data preprocessing, data exploration, and data visualization; Anna Zhou—data exploration and interactive component implementing.

2. Description of Data

We download Times Square Food & Beverage Locations and Times Square Hotels datasets from NYC Open Data (https://opendata.cityofnewyork.us/) and fetch Yelp data through Yelp API. The data binding and data wrangle processes are described in following sections.

2.1. Data Cleaning, Yelp Combining, and Calculating

(R script: https://github.com/pinhao1994/best-at-times-square/blob/master/Data_Preprocess.Rmd)

2.1.1. Data Cleaning

The first step would be formating the phone numbers to be a sequence of integer numbers by replacing the special characters, like “(”, “)”, and “-”.

Secondly, removing “Website”, “Location.1”, and “Borough” columns in both datasets due to following reasons.

  1. Website column: there are many missing values in the column, and all those website url links give us pretty limited information. For example, all the Starbucks have the same official website in our original dataset. This is not helping to separate or to compare which Starbucks will be better according to the link.

  2. Location.1 column: this column contains tuple of latitude and longitude which is duplicated since there are already two other columns containing the same information.

  3. Borough column: Since all of our data points are around Times Square, they have same borough – Manhattan – which is not informative.

Because Food and Beverage Data is more complicated, there are more following steps would be working on. Third, replacing special unicode character such as “Café / Deli” with “Deli” in the “Subindustry” column. Forth, substituting the value in “Subindustry” with the value in “Sub.Subindustry” column, if the former value is “Quick Serve” or “Full Serve”. Finally, removing “Sub.Subindustry” column from the dataset since we have combined it with the Subindustry column.

2.1.2. Yelp Combining

Yelp API allows us to grab info to fetch the “Yelp Category”, “Yelp Rating”, and “Review Count on Yelp”. The program we setup would search stores (hotels and restaurants) by the Phone Number; if there was no matches by searching phone numbers, the program would search hotels and restaurants by the combination of other features, such as Address, category, etc.; after the previous two steps, if the program cannot find the corresponding objects, it would recommend 5 candidates for users to manually decide the Yelp store for preprocessing.

2.1.3. Caculating

We deployed levenshtein distance to find the the neighbor of each hotel or restaurant within \(100^2\pi\) square meters.

Moreover, averaged the Yelp ratings, prices, and review counts of all neighbourhood restaurants for each hotels.

We also add an index to calculate the score. If a restaurant is rated with 4 stars but only having 2 persons rate it, the rating of the restaurant is not reliable. The formula would be (60% x review counts)*(40% x rating).

2.2. Web Interactive: GeoJson Processing

(R Script: https://github.com/pinhao1994/best-at-times-square/blob/master/Data_Preprocess_Map.Rmd)

In order to perform the web interactive part, we pre-calculate and format the data into self-defined form.

For Hotel data, we add two more columns top5cat and top3restaurant to store further information as follow:

  1. top5cat: Within \(100^2 \pi\) square meters, counting what would be the top 5 categories with respect to each hotel. Store the information in the form of “\(Category_A:Count_A\) | \(Category_B:Count_B\) |…”.

  2. top3restaurant: Within \(100^2 \pi\) square meters, counting what would be the top 3 restaurants with respect to each hotel by our designed importance weight. Store the information in the form of “\(food\_id_A\) | \(food\_in_B\) | \(food\_id_C\)”.

For Restaurant data, we add one more column called Category2ndlevel. We would like to have our upper level categories to be classified according to continents. So we have Asian, Mid East, North American, South American. Also, we keep other categories, Steakhouse, Coffee, Pizza, Deli, and Other. Also, as for those extra categories, it is ambiguous to classify them to the continental categories, but if we group them into Other, that would cause Other overweighted.

Last, we transformed csv to geojson by online transformer (https://mygeodata.cloud/converter/csv-to-geojson). Geojson type of data is prepared for the interactive web.

3. Data Analaysis Quality

3.1. General Distribution

We want to explore the restuarant data gerenally in rating, review, postcode, price, category and missing data.

library(tidyverse)
library(ggplot2)
food<-read.csv('data/food_processed.csv')

# plot rating distribution
## from the histgram we know that most of our resturant are rated as 3.0~4.0 and there is no resturant rated as 5. And it's a left-skewed data
ggplot(data = food, aes(food$Rating)) + geom_histogram(binwidth = 0.5, color = "lightblue") + labs(x="Rating")

From the histgram we know that most of our resturant are rated as 3.0~4.0 and there is no resturant rated as 5. And it’s left-skewed.

# plot postcode distribution

ggplot(data = food, aes(factor(food$zip_code))) + geom_bar() + labs(x="Zip Code")

From the histgram we can see that most of our resturants are located in 10036 and 10019 district. The restuarant in 10011, 10012, 10033 might be outliers or out of boarder.

# plot price distribution

ggplot(data = food, aes(food$Price)) + geom_bar(color = "lightblue") + labs(x="Price")

We can see that the majority resturants have price on level 2 and no resturant on level 5. Since our price data are shown as dollar sign from yelp API, the data point of 2.5 might be an error.

# plot review distribution 

ggplot(data = food, aes(food$Review_Count)) + geom_histogram() + labs(x="Review Count")

From the plot we can see that most of our restuarants review are below 1,000. The higher the review is, the fewer resturants there are. And there are some outliers that have more than 1,000 ratings.

## from this plot we can see that most common category are American, Deli, Italian, Pizza, and steakhouse.
ggplot(data = food,aes(food$Category_data)) +
  geom_bar()+coord_flip() + labs(x="Original Category")

From this plot we can see that, there are 35 categories in total. The most common category are American, Deli, Italian, Pizza, and steakhouse.

ggplot(data = food,aes(food$Category_2nd_Level)) +
  geom_bar()+coord_flip() + labs(x="Self-defined Category")

According to the second level category, we can see that North American, Deli and Europe are most common 3 category.

3.2. Missing Pattern

#library(tidyverse)
#library(ggplot2)
#library(dplyr)
#library(tidyr)
#library(DAAG)
food <- read.csv('data/food_processed.csv', header=T, na.strings=c("", "NA"))

row.names(food) <- food$ID

tidyfood <- food %>%
  rownames_to_column("id") %>%
  gather(key, value, -id) %>%
  mutate(missing = ifelse(is.na(value), "yes", "no"))

ggplot(tidyfood, aes(x = key, y = fct_rev(id), fill = missing)) +
  geom_tile(color = "white") + 
  ggtitle("food data with missing values") +
  #scale_fill_viridis_d() + # discrete scale
  theme_bw() + labs(x="Columns", y="ID")

It’s obvious that there are 3 missing pattern in our data. The most common ones are Price, Rating, Review count and Score (the Score is calculated from Price and Rating). And the second common feature is missing category data. The third pattern is missing them both. The first situation is because the data in yelp are missing.

4. Main Analysis

# plot rating distribution fill by district(postcode)

ggplot(data = food, aes(food$Rating, fill=factor(food$zip_code))) + 
  geom_histogram(aes(x= food$Rating,y = (..count..)/sum(..count..)), 
                 breaks=c(0.0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0),
                 position = 'dodge') + 
  labs(x="Rating", y="Density", fill="Zip Code")

We can see that the resturant’s rating in district 10036 has a peak around 3.5. And the resturant’s rating in distrct 10019 has a peak around 3. Both of the 10036 and 10019 distributions are left skewed.

## For price, there is no much difference in 10019 and 10036, both of the districts have 2 price resturants.
#ggplot(data = food, aes(food$Price, fill=factor(food$Postcode))) + geom_histogram(aes(x= food$Price), binwidth = 1,position = 'dodge')
ggplot(data = food, aes(food$Price, fill=factor(food$zip_code))) + 
  geom_histogram(aes(x= food$Price,y = ..density..),binwidth = 1,position = 'dodge') +
  labs(x="Price", fill="Zip Code")

The above graph shows that most of the restaurants in the area have Price level 2. District 10019 and 10036 seem to have similar price distribution pattern.

And we can see that 10011, 10012, 10033 only have 1 retuarant in that district, so the density will always be 1.

## plot the review based on postcode.
ggplot(data = food, aes(food$Review_Count, fill=factor(food$zip_code))) + 
  geom_histogram(position = 'dodge',binwidth = 50) + 
  xlim(0, 800) +
  labs(x="Review Count", fill="Zip Code")

## we draw some boxplot to explore the data.
## we find that restuarnts in 10036 are more likely to have more reviews. 

ggplot(food, aes(factor(food$zip_code), food$Review_Count)) +
  geom_boxplot() + 
  coord_flip() +
  labs(x="Zip Code", y="Review Count")

For review count, we find that restuarnts in 10036 are more likely to have more reviews. Restaurants located at 10019 comes the second. The restuarant in 10018 are more sparse.

## hex plot or scatter plot

ggplot(food,aes(x = Rating,y = Review_Count))+stat_bin_hex()+labs(y="Review Count")

ggplot(food,aes(x = Rating,y = Review_Count))+geom_point(position = 'jitter', alpha = .3)+labs(y="Review Count")

From the plot we can see that,

  1. there is a cluster around rating 3~4 and review less that 500.

  2. the more review, there is more likely to be high ratings.

  3. There is no resturant have more than 600 review that has low ratings.

  4. The resturant having over 1,000 reviews are outliers and they are all rating 3.5 or 4.0

# we explore the relationship between price and review

ggplot(food,aes(x = Price,y = Review_Count))+stat_bin_hex()+labs(y="Review Count")

ggplot(food,aes(x = Price,y = Review_Count))+geom_point(position = 'jitter', alpha = .3)+labs(y="Review Count")

  1. There is a cluster around Price 1-2 and review below 500.

  2. Outliers are the points over 1000 reviews and over 3 ratings.

data_new<-subset(food,food$zip_code %in% c(10036,10019,10018,10020))
data_new<- subset(data_new,data_new$Category_2nd_Level %in% c('North American','Deli','Europe','Asian'))


counts3 <- data_new %>% drop_na(`zip_code`,`Category_2nd_Level`)%>%group_by(data_new$`Category_2nd_Level`,  data_new$`zip_code`) %>% summarize(Freq =n())

colnames(counts3)<-c('Category_data','zip_code','Freq')
factor_cat <- factor(counts3$Category_data)

vcd::mosaic(factor_cat~zip_code, direction = c('v','h'), counts3, rot_labels=c(0,90,0,0),
            offset_varnames = c(left = 5), offset_labels = c(left = 0),
            just_labels = c("center","right"), margins = c(left = 5), set_varname=c(zip_code="Zip Code", factor_cat="Category"))

From the plot we can see that the majority data lies in 10019 and 10036. And district will influence the retuarants category. We can see that there are more North American restuarants in 10036, and relatively less other types retuarants.

Parallel Coordinate

library(GGally)
library(ggplot2)
library(tidyverse)

parallel_data <-food[c(4,5,6,7,8,9,11)]

parallel_data$Category_data<-factor(parallel_data$Category_data)
#parallel_data$zip_code<-factor(parallel_data$zip_code)
parallel_data$Category_2nd_Level<-factor(parallel_data$Category_2nd_Level)
parallel_data$Price<-factor(parallel_data$Price)

#ggparcoord(parallel_data ,alphaLines = .7, groupColumn = 'Category_2nd_Level', scale = "uniminmax")+ylab('Data')+xlab('Indicator')
#ggparcoord(parallel_data ,alphaLines = .7, groupColumn = 'Category_2nd_Level', scale = "globalminmax")+ylab('Data')+xlab('Indicator')
#ggparcoord(parallel_data ,alphaLines = .7, groupColumn = 'Category_2nd_Level', scale = "robust")+ylab('Data')+xlab('Indicator')
#ggparcoord(parallel_data ,alphaLines = .7, groupColumn = 'Category_2nd_Level', scale = "std")+ylab('Data')+xlab('Indicator')
food_pl <- as.data.frame(food[, c(2,4,5,6,7,8,9,10,11)])
food_pl$Score <- round(food_pl$Score)
food_pl$Rating <- round(food_pl$Rating)
food_pl[, 1:9] <- lapply(food_pl[, 1:9], factor)
food_plna <- na.omit(food_pl, cols=c("Score"))
colnames(food_plna)
## [1] "ID"                 "Category_data"      "Category_2nd_Level"
## [4] "Rating"             "Review_Count"       "Price"             
## [7] "Street_Num"         "zip_code"           "Score"
food_al <- food_plna %>% 
  #drop_na(Score) %>%
  group_by(Category_2nd_Level, Rating, Price, Street_Num, zip_code, Review_Count,Category_data) %>% #
  summarise(Freq = n())

library(alluvial)
pal <- RColorBrewer::brewer.pal(10, "Set3")
alluvial(food_al[, c("Category_2nd_Level", "Rating", "Price", "zip_code")], freq = food_al$Freq, 
         blocks = TRUE,
         alpha = 0.8,
         col = pal[match(food_al$Category_2nd_Level,
                       unique(food_al$Category_2nd_Level)) ])  

The majority of the restaurants belongs to North American food, and then comes European style restaurants and Deli. Most of the North American and European restaurants have ratings at 4 and above, and most of those restaurants are not too expensive having ratings at level 2 (around 20 to 30 dollars per person per meal). Asian restaurants comes to the forth biggest group in our data. Their ratings range from 2 to 4. More than half of the Asian are ranked at level 4. Approximately, 95% of the Asian restaurants set the price levels within two dollar signs. Coffee shops are usually with low ratings and low cost. Steakhouse are all ranked high and the price level are also high. More than 50% of the restaurants are located in zip code 10036 area. Their categories are mainly North American, European, coffee shop, Asian food, and steakhouse.

library(tidyverse)
library(ggplot2)

hf100 <- read.csv("data/hotel_food_100.csv")
hf <- na.omit(hf100)

ggplot(hf, aes(food_price)) +
  geom_histogram(binwidth = 1, color ="blue", fill = "lightblue") +
  facet_wrap(~hotel_price) +
  ggtitle("Price distribution of hotels and neighbourhood restaurants") +
  xlab("Dollar sign of restaurants") + 
  ylab("Counts of hotels within 100^2pi sq meters") +
  stat_bin(aes(y=..count.., label=..count..), geom="text", vjust=-.5, binwidth = 1) 

ggplot(hf, aes(food_rating)) +
  geom_histogram(binwidth = 1, color ="blue", fill = "lightblue") +
  facet_wrap(~hotel_rating2, nrow = 2) +
  ggtitle("Rating distribution of hotels and neighbourhood restaurants") +
  xlab("Rating level of restaurants") + 
  ylab("Counts of hotels within 100^2pi sq meters") +
  stat_bin(aes(y=..count.., label=..count..), geom="text", vjust=-.5, binwidth = 1) 

ggplot(hf, aes(x = food_price,y = food_review_count)) +
  geom_col(fill = "lightblue") +
  facet_wrap(~hotel_rating2, nrow = 2) +
  ggtitle("Review Counts of hotels and neighbourhood restaurants") +
  xlab("Review Counts of restaurants") + 
  ylab("Counts of hotels within 100^2pi sq meters") 

We could find some interesting facts here.

  1. The higher the rating of the restaurant it is, it is not necessary that the ratings of the neighbourhood restaurants are higher. Taking the rating level 5 as an example, there are not that many reviews nor high ratings. One of the reasons might be not that many restaurants rated as 5 or 1. A lot of the hotels are rated as level 3.

  2. There are a lot of restaurants rated as level 3, and the majority of neighbourhood restaurants around level 3 hotels are also on level 3.

  3. The restaurants in Time Square are not too expensive. The most expensive ones are with three dollar signs.

  4. The distributions of each price level of hotels are very similar. 4 dollar sign restaurants are very rare, and the proportion of 2 dollar sign restaurants is nearly 50% of the data at each level.

  5. Hotels with 2 dollar signs have the most counts in Time Square.

5. Executive Summary

6. Interactive Component: D3, jQuery, and Bootstrap

Interactive Web System: https://pinhao1994.github.io/EDAV/

We built our website by D3, jQuery, and Google Map API. In order to project our data points onto the map, we built a projector transforming longitude and latitude into svg path object. Also, Bootstrap model (https://getbootstrap.com/) was utilized to implenment responsive web design which allowed our website to become mobile friendly. There are three features that have been designed: Mouseover, Clicking, and Category-Choosing.

When users mouse over each data point, our system will show the information of that restaurant or hotel; if the data point is a hotel location (in blue color), there will also be a histogram to show the top 5 restaurant categories surround the hotel. When users click the hotel data points (blue points), the system will draw out a circle area with 100 meters as radius, and the top 3 restaurants near the hotel will be shown in yellow color. Last but not the least, users can choose restaurant categories, the restaurant with chosen category will be marked in red. For more details, please feel free to visit our website https://pinhao1994.github.io/EDAV/.

7. Conclusion

In this project, the main limitations come from the nature of our combined dataset. Since we combined NYC open data on Times Square’s hotels and restaurants with information scrapped from Yelp API and they have different recording strategies, matching each item perfectly is challenging. For the same item, its name and placed category in two resources might deviate. As a result, Yelp API does not have research results for about 5% of rows in NYC open data, even address and phone number are given. Although we have a high precision rate, there is a bottleneck for the recall rate. In addition, the restaurants’ categories defined by two resources are not uniform. For example, some are categorized by political regions, like Italian food or Japanese food, others are labeled by food type, such as café and steakhouse. Preprocessing it manually might lead to information leakage or misinterpretation, which undermines our analysis accuracy. Besides, most of our data are categorical, which limits our graphing choice and exploratory approaches. In the future, we wish to try incorporate more numeric variables in order to explore at more angles. It would also be interesting if we can extend this project to the Manhattan area or even the entire New York City. In that case, we can look at broader distance circle for each hotel and examine clustering pattern for different attributes.